\title{Supervised Learning (Classification)}

\subsection{Supervised Learning (Classification)}

In supervised learning, the task is to infer hidden structure from
labeled data, comprised of training examples $\{(x_n, y_n)\}$.
Classification means the output $y$ takes discrete values.

We demonstrate with an example in Edward.
An interactive version with Jupyter notebook is available
\href{http://nbviewer.jupyter.org/github/blei-lab/edward/blob/master/notebooks/supervised_classification.ipynb}{here}.

\subsubsection{Data}

We use the
\href{https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/crabs.html}
{crabs data set},
which consists of morphological measurements on a crab species. We
are interested in predicting whether a given crab has the color form
blue (encoded as 0) or orange (encoded as 1). We use all the numeric features
in the dataset.
\begin{lstlisting}[language=Python]
from observations import crabs

data, metadata = crabs("~/data")
X_train = data[:100, 3:]
y_train = data[:100, 1]

N = X_train.shape[0]  # number of data points
D = X_train.shape[1]  # number of features

print("Number of data points: {}".format(N))
print("Number of features: {}".format(D))
\end{lstlisting}

\begin{lstlisting}
Number of data points: 100
Number of features: 5
\end{lstlisting}

\subsubsection{Model}

A Gaussian process is a powerful object for modeling nonlinear
relationships between pairs of random variables. It defines a distribution over
(possibly nonlinear) functions, which can be applied for representing
our uncertainty around the true functional relationship.
Here we define a Gaussian process model for classification
\citep{rasmussen2006gaussian}.

Formally, a distribution over functions $f:\mathbb{R}^D\to\mathbb{R}$ can be specified
by a Gaussian process
\begin{align*}
  p(f)
  &=
  \mathcal{GP}(f\mid \mathbf{0}, k(\mathbf{x}, \mathbf{x}^\prime)),
\end{align*}
whose mean function is the zero function, and whose covariance
function is some kernel which describes dependence between
any set of inputs to the function.

Given a set of input-output pairs
$\{\mathbf{x}_n\in\mathbb{R}^D,y_n\in\mathbb{R}\}$,
the likelihood can be written as a multivariate normal
\begin{align*}
  p(\mathbf{y})
  &=
  \text{Normal}(\mathbf{y} \mid \mathbf{0}, \mathbf{K})
\end{align*}
where $\mathbf{K}$ is a covariance matrix given by evaluating
$k(\mathbf{x}_n, \mathbf{x}_m)$ for each pair of inputs in the data
set.

The above applies directly for regression where $\mathbb{y}$ is a
real-valued response, but not for (binary) classification, where $\mathbb{y}$
is a label in $\{0,1\}$. To deal with classification, we interpret the
response as latent variables which is squashed into $[0,1]$. We then
draw from a Bernoulli to determine the label, with probability given
by the squashed value.

Define the likelihood of an observation $(\mathbf{x}_n, y_n)$ as
\begin{align*}
  p(y_n \mid \mathbf{z}, x_n)
  &=
  \text{Bernoulli}(y_n \mid \text{logit}^{-1}(z_n)).
\end{align*}

Define the prior to be a multivariate normal
\begin{align*}
  p(\mathbf{z})
  &=
  \text{Normal}(\mathbf{z} \mid \mathbf{0}, \mathbf{K}),
\end{align*}
with covariance matrix given as previously stated.

Let's build the model in Edward. We use a radial basis function (RBF)
kernel, also known as the squared exponential or exponentiated
quadratic. It returns the kernel matrix evaluated over all pairs of
data points; we then Cholesky decompose the matrix to parameterize the
multivariate normal distribution.
\begin{lstlisting}[language=Python]
from edward.models import Bernoulli, MultivariateNormalTriL
from edward.util import rbf

X = tf.placeholder(tf.float32, [N, D])
f = MultivariateNormalTriL(loc=tf.zeros(N), scale_tril=tf.cholesky(rbf(X)))
y = Bernoulli(logits=f)
\end{lstlisting}
Here, we define a placeholder \texttt{X}. During inference, we pass in
the value for this placeholder according to data.

\subsubsection{Inference}

Perform variational inference.
Define the variational model to be a fully factorized normal.
\begin{lstlisting}[language=Python]
qf = Normal(loc=tf.get_variable("qf/loc", [N]),
            scale=tf.nn.softplus(tf.get_variable("qf/scale", [N])))
\end{lstlisting}

Run variational inference for \texttt{500} iterations.
\begin{lstlisting}[language=Python]
inference = ed.KLqp({f: qf}, data={X: X_train, y: y_train})
inference.run(n_iter=500)
\end{lstlisting}
In this case
\texttt{KLqp} defaults to minimizing the
$\text{KL}(q\|p)$ divergence measure using the reparameterization
gradient.
For more details on inference, see the \href{/tutorials/klqp}{$\text{KL}(q\|p)$ tutorial}.
(This example happens to be slow because evaluating and inverting full
covariances in Gaussian processes happens to be slow.)

% \subsubsection{Criticism}

\subsubsection{References}\label{references}
